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Abstract 

The structural cohesion model is a powerful theoretical conception of cohesion in social 
groups, hut its diffusion in empirical literature has been hampered by operationalization and 
computational problems. In this paper we start from the classic definition of structural co¬ 
hesion as the minimum number of actors who need to be removed in a network in order to 
disconnect it, and extend it by using average node connectivity as a finer grained measure 
of cohesion. We presenf useful heurisfics for compufing sfrucfural cohesion fhaf allow a 
speed-up of one order of magnifude over fhe algorifhms currenfly available. We analyze 
fhree large collaborafion nefworks (co-maintenance of Debian packages, co-authorship in 
Nuclear Theory and High-Energy Theory) and show how our approach can help researchers 
measure structural cohesion in relatively large networks. We also introduce a novel graph¬ 
ical representation of the structural cohesion analysis to quickly spot differences across 
networks. 
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Group cohesion is a central eoneept that has a long and illustrious history in soeiology and 
organization theory, although its preeise eharaeterization has remained elusive. Its use in most 
soeiologieal researeh has been ambiguous at best. This is largely beeause, as Moody and White 
(2003) argued, it is often based on sloppy operationalization grounded mostly in intuition and 
eommon sense. Network analysis has provided a large number of solutions to this problem. 
From classieal work in the graph-theoretic sociological tradition on cliques, clans, clubs, k- 
plexes, fc-cores and lambda sets (Wasserman and Faust, 1994, chapter 8), to the more recent 
eontribution of physieists and eomputer scientists on eommunity analysis (Fortunato, 2010), 
network theorists have provided researchers with a wide range of measures of cohesion in soeial 
networks. 

However, neither the elassical approaehes nor new developments in eommunity analysis are 
well-enough suited to address many of the common uses of group eohesion in the soeiologieal 
and organizational literature, for three key reasons. First, while most of these measures can 
help us identify eohesive subgroups, they do not provide insight into their robustness, which is 
a critical element to the theoretieal eoneeptualization of eohesion. In most cases, the removal 
of only a few aetors from the subgroups ean lead to its fragmentation into smaller diseonneeted 
groups (White and Harary, 2001). Seeondly, many eohesive subgroup measures do not allow for 
overlap among subgroups. Finally, even when they do allow for overlap, most measures eannot 
eapture the hierarchieal nature of nested soeial groups, where subgroups, like Russian dolls, are 
reeursively nested in one another. As a result, hardly any of the existing measures eapture the 
theoretical complexity of cohesion, and thus fall short of offering useful operationalizations for 
many empirical phenomena of soeiologieal interest. 

One model whieh provides a more fertile ground for soeiologieal analysis is the struetural 
eohesion model (White and Harary, 2001; Moody and White, 2003). This model is grounded 
on two common conceptualizations of group cohesion in the literature. A soeial group is eon- 
sidered cohesive to the extent that: a) it is resistant to being pulled apart by the removal of some 
of its members; and b) pairs of its members have multiple direet or indirect connections that pull 
it together (White and Harary, 2001, 309-310). Building on the concept of node conneetivity 
from graph theory, the structural cohesion of a group is defined in this model as the minimal 
number of actors who need to be removed from the group to disconnect it. Despite its solid and 
elegant mathematieal foundation, the struetural eohesion model has not been widely used in 
empirieal analysis beeause it is not possible to perform the required eomputations for networks 
with more than a few thousands nodes and edges in a reasonable time frame. 

These eomputational ehallenges also hindered the development of an interesting feature of 
the struetural eohesion model: its applicability to both bipartite and unipartite networks. While 
many soeial networks are essentially bipartite in nature (as people meet, interaet, and eollaborate 
around speeifie events and/or objeets), most of our analytieal tool-kit was developed to analyze 
one-mode networks (Latapy, Magnien, and Veeehio, 2008). Therefore it was common practice 
to eonduet network analysis on one-mode projeetions only, but it is now elear that this praetice 
leads to biased estimates of key measures, as reeent work on the elustering eoeffieient has am¬ 
ply shown (Robins and Alexander, 2004; Lind et ah, 2005; Latapy et ah, 2008). The struetural 
eohesion model, instead, can be applied without modifieation to both bipartite and unipartite 
networks (White, Owen-Smith, Moody, and Powell, 2004). That said, the original algorithm is 
prohibitively time-consuming to compute, espeeially with the exponential growth in the size of 
available network data. 

In this paper we extend the struetural eohesion model by using the eoneept of average node 
eonnectivity, that is the average number of actors who need to be removed from the group to 
disconnect an arbitrary pair of actors in the group. We present a set of heuristics to com¬ 
pute struetural eohesion based on the fast approximation to compute pairwise node independent 
paths (White and Newman, 2001). We implemented it in NetworkX (Hagberg et ah, 2008), a 


2 


Python Library for Complex Network Analysis. The heuristies presented here allow us to eom- 
pute the approximate value of group eohesion for moderately large networks, along with all 
the hierarehieal struetures of eonneetivity levels, one order of magnitude faster than implemen¬ 
tations whieh are eurrently available. We also suggest a novel graphieal representation of the 
results of the analysis that might help synthetieally eommunicate results and spot differenees 
aeross different networks (Moody, McFarland, and Bender-deMoll, 2005). 

We used our implementation of the heuristics proposed in this paper to analyze three large 
collaboration networks: the co-maintenance network of Debian packages, and the co-authorship 
networks in Nuclear Theory and High-Energy Theory. We ran our analysis in both one-mode 
and two-mode networks, and compare the networks in terms of their connectivity structure. 
Consistent with the literature on two-mode networks, we show that the complex hierarchy of 
collaboration captured in the two-mode analysis is a better representation of the connectivity 
structure of empirical networks than their one-mode counterparts. 

The rest of the paper is organized as follows: we start by laying out the notation we use in 
the rest of the paper. Then we discuss the main features which a cohesive subgroup formaliza¬ 
tion should have from a sociological perspective, reviewing the most important formali z ations 
of cohesive subgroups in the social network literature and discussing in depth the structural 
cohesion model. We then describe the exact algorithm proposed by Moody and White (2003) 
to compute the connectivity hierarchy of a given network. After that, we introduce our pro¬ 
posed heuristics, and describe their implementation and performance. We go on to report our 
findings from applying the structural cohesion analysis to three large collaboration networks, as 
well as proposing a novel graphical representation of the connectivity structure using a three- 
dimensional scatter plot. Finally we conclude with implications for future research. 


1 Terminology and notation 

An undirected graph G = (L, E) consists of a set V(G) of n nodes and a set E{G) of m edges, 
each one linking a pair of nodes. The order of G is its number of nodes n and the size of G 
is its number of edges m. Two nodes are adjacent if there is an edge that links them, and this 
edge is said to be incident with the two nodes it links. A subgraph of G is a graph whose nodes 
and edges are all in G. An induced subgraph G[U] is a subgraph defined by a subset of nodes 
U C V(G) with all the edges in G that link nodes in U. A subgraph is maximal in respect to 
some property if the addition of more nodes to the subgraph will cause the loss of that property. 

A path is an alternating sequence of distinct nodes and edges in which each edge is incident 
with its preceding and following nodes. The length of a path is the number of edges it contains. 
The shortest path between two nodes is a path with the minimum number of edges. The distance 
between any two nodes u and v of G, denoted ddu, v), is the length of the shortest path between 
them. The diameter of a graph G, denoted diam{G), is the length of the longest shortest path 
between any pair of nodes of G. Node independent paths are paths between two nodes that 
share no nodes in common other than their starting and ending nodes. A graph is connected 
if every pair of nodes is joined at least by one path. A component of a graph G is a maximal 
connected subgraph, which means that there is at least one path between any two nodes in that 
subgraph. 

The density of a graph G, denoted ^(G), measures how many edges are in set E{G) com¬ 
pared to the maximum possible number of edges among nodes in V (G). Thus, density is calcu¬ 
lated as ^(G) = A complete graph is a graph in which all possible edges are present, so 

its density is 1. A clique is an induced subgraph G\U] formed by a subset of nodes U C 1/(G) 
if, and only if, the induced subgraph G[U] is a complete graph. Thus, there is an edge that 
links each pair of nodes in a clique. The degree of a node v, denoted deg{v), is the number 
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of edges that are incident with v. The minimum degree of a graph G is denoted 5{G) and it is 
the smallest degree of a node in G. A k-core of G is a maximal subgraph in which all nodes 
have degree greater or equal than /c; which means that a /c-core is a maximal subgraph with the 
property 5 > k. The core number of a node is the largest value /c of a /c-core containing that 
node. 

The removal of a node v from G results in a subgraph G — v that does not contain v nor 
any of its incident edges. The node connectivity of a graph G is denoted k{G) and is defined as 
the minimum number of nodes that must be removed in order to disconnect the graph G. Those 
nodes that must be removed to disconnect G form a node cut-set. If it is only necessary to 
remove one node to disconnect G, this node is called an articulation point. We can also define 
the local node connectivity for two nodes u and v, denoted kg{u, v), as the minimum number of 
nodes that must be removed in order to destroy all paths that join u and v in G. Then the node 
connectivity of G is equal to min{KG{u,v) : u,v E V{G)}. Similarly, the edge connectivity 
of a graph G is denoted A(G) and is defined as the minimum number of edges that must be 
removed in order to disconnect the graph G. The edges that must be removed to disconnect G 
form an edge cutset. 

The measures discussed above are defined as properties of whole graphs but they can also 
be applied to subgraphs. A k-component is a maximal subgraph of a graph G that has, at least, 
node connectivity k: we need to remove at least k nodes to break it into more components. 
The component number of a node is the largest value /c of a /c-component containing that node. 
Notice that /c-components have an inherent hierarchical structure because they are nested in 
terms of connectivity: a connected graph can contain several 2-components, each of which can 
contain one or more tricomponents, and so forth. 


2 Cohesion in social networks 

Doreian and Fararo (1998) argue that group cohesion can be divided analytically into an ideational 
component, which is based on the members’ identification with a collectivity, and a relational 
component, which is based on connections among members. These connections are, at least in 
part, observable, and thus the relational approach seems more appropriate for theory building 
and empirical research. But, despite its attractiveness, the relational component has received 
much less attention than the ideational component in sociological literature. Social network 
analysis has been the exception, and since the beginning, its proponents formalized group co¬ 
hesion in relational terms, that is, they defined the boundaries of subgroups in a community 
starting from the patterns of relations among actors. 

Unfortunately most of the existing formalizations of cohesive subgroups do not capture 
some key properties of the theoretical concept of cohesive groups. First, a cohesive subgroup 
should be robust, in the sense that its qualification as a group should not be dependent on the 
actions of a single individual, or any small set of individuals that belong to the group. This 
implies, on the one hand, that no actor, or small set of actors, should be able to dissolve the 
cohesive subgroup by abandoning it; while, on the other hand, all actors in a group should be 
related to all other actors by multiple direct or indirect connections in order to pull it together 
(White and Harary, 2001; Moody and White, 2003). Therefore, cohesive subgroups should also 
be relatively invariant to changes outside the group (Brandes and Erlebach, 2005, chapter 6). 

Second, actual social groups tend to overlap in the sense that some actors are likely to be 
part of more than one cohesive subgroup. As Freeman (1992) notes, formalizations of sub¬ 
groups that overlap a lot are not well suited to capturing the theoretical concept of groups 
because their sociological use is not focused on individuals but on contexts, such as produc¬ 
tive relations, friendship relations, or family ties, to name a few. Thus if groups are defined 
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around a higly specific context the overlap is likely to be small. Therefore the formalization of 
subgroups often assumed non-overlapping subgroups. Moreover, non-overlapping subgroups 
can be used to develop categorical variables for membership that could be used in regression 
analysis (Borgatti et ah, 1990). However, there is always overlap among cohesive subgroups in 
actual social groups; and this overlap might be both empirically and theoretically relevant. 

Third, following a typical distinction in the social network literature, cohesive groups have 
both a structural and a positional dimension. In the former, cohesive subgroups are defined in 
terms of the global patterns of relations, and the focus is on the groups and the network as a 
whole. In the latter, the focus is on the identification of actors who, because of their network 
position, obtain preferential access to information or resources that flow through the network. 
Cohesive subgroup formalizations should help address both structural and positional questions. 

Last but by no means least, cohesive subgroups are likely to display a hierarchical structure 
in the sense that highly cohesive subgroups are nested inside less cohesive ones. This notion 
of hierarchy is grounded on Simon’s definition: “a system that is composed of interrelated 
subsystems, each of the latter being, in turn, hierarchic in structure until we reach some lowest 
level of elementary subsystem” (Simon, 1962, 468). A hierarchical conception of cohesive 
subgroups implies that there is a relevant organization at all scales of the network, and that 
cohesive groups are a mesolevel structure that is not reducible to neither macro nor micro level 
phenomena and dynamics. This nested conception of cohesive subgroups provides a direct link 
with the structural dimension of the sociological concept of embeddedness (Granovetter, 1985). 
The nested nature of cohesive groups allows one to operationalize social relations that are, in 
direct contrast to arms length relations, structurally embedded in a social network. 

In the following section we briefly review existing social network formalizations of sub¬ 
group cohesion. For each method, in table 1 we provide the definition, the underlying logic, the 
measure proposed, and evaluate them in terms of the four criteria just described. We will there¬ 
fore consider whether they are robust, can allow for overlapping groups, provide information on 
both the structure and the position of nodes, and whether they capturethe hierarchical structure 
of the groups. 

2.1 Formalizations of cohesive subgroups 

Historically, the first social networks approaches to subgroup cohesion formalization identified 
cohesive subgroups by considering only internal ties among the actors in the group. How¬ 
ever, most recent formalizations define cohesive subgroups by considering both internal ties 
among its members and also external ties between each subgroup and the rest of the network 
(Wasserman and Faust, 1994). All the formalizations based on internal ties are based on the 
concept of clique, which were later generalized by relaxing some of the strict conditions of dis¬ 
tance, degree or density that the clique concept imposes. The formalizations that consider both 
internal and external ties can be organized in two main categories depending on whether they 
use density or connectivity to measure internal and external ties. 

The first formalization of cohesive subgroups was the concept of clique (Luce and Perry, 
1949), which is a maximal subset of actors in which each actor is directly connected to every 
other actor in the subgroup. For small groups in some contexts, such as friendship networks, it 
makes sense to use the clique concept. However, in many contexts, especially in large and/or 
very sparse networks, it is unlikely that the existing cohesive subgroups will be formed by actors 
that have direct relations with all other actors in the subgroup. Cliques, however, intuitively 
capture the idea that a cohesive subgroup exists independently of the action of any individual 
in the group. Thus the group is robust because it cannot be disconnected by removing any 
individual actor. Cliques can overlap —and they usually do so a lot— but they do not display a 
hierarchical organization. Because of the limitations of the clique concept, some generalizations 
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were developed; on the one hand, there emerged a family of generalizations based on relaxing 
distanees among members of the subgroup —n-eliques , n-elans, and n-elubs (Mokken, 1979); 
and, on the other, generalizations based on relaxing the number of links between members of 
the subgroup —/c-plex (Seidman and Foster, 1978), and k-cores (Seidman, 1983b). 

All these generalizations exeept for k-core are quite arbitrary beeause the analyst has to 
set the parameters n ox k depending on the eonerete aim of the analysis at hand and its em¬ 
pirical setting. Thus, k-coxo is the only generalizationof the clique concept with an inherent 
hierarchical structure: 3-cores are always nested inside 2-cores; and 4-cores inside 3-cores, and 
so forth. Thus, this formalization captures an important aspect of the sociological concept of 
cohesive groups. However, /c-cores are not robust because the removal of a few actors could 
potentially disconnect them; in fact they don’t even need to be connected at all to be a k-coxe 
(White and Harary, 2001). Furthermore, the definition of A:-core only considers internal rela¬ 
tions among actors within it, without considering relations with the rest of the network. 

Another important subset of subgroup formalizations identifies cohesive subgroups by com¬ 
paring the internal and external ties of subgroups members. The two key criteria to define groups 
in these categories are density and connectivity. The first formalization of this kind was the LS 
set (Luccio and Sami, 1969; Lawler, 1973): a set of nodes in which each of its proper subsets 
has more ties with the nodes outside that subset than the LS set itself. The main idea is that 
an LS set is a union of subsets of nodes. This union is better than any subset in terms of co¬ 
hesion because it has fewer connections to the outside. Thus, actors in the LS set have more 
connections to other members than to outsiders. LS sets are robust to the removal of edges and 
they have an inherent hierarchical structure; however, due to their strict requirements, only very 
few LS sets are actually found in empirical social networks. Lambda sets (Borgatti et ah, 1990) 
were introduced as a generalization of LS sets designed to capture only the edge-connectivity 
properties of the LS sets. Lambda sets are maximal subsets of nodes that have more edge inde¬ 
pendent paths between them than with nodes outside the subset. This generalization, however, 
does not capture important features of the sociological concept of group cohesiveness. On the 
one hand, they are not robust to the removal of nodes, and, on the other hand, the edge inde¬ 
pendent paths that link the members of a Lambda set can go through nodes that are not in the 
lambda set, thus there is no strict separation between the role of actors inside and outside a 
lambda set in respect to its internal cohesion. 
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Based on 

Criteria 

Measure 

Definition 

Robust 

Overlap 

Positional 

Hierarchical 

Computational 

Absolute: only internal 

complete 

connectivity 

diam = g = \ 

5 = X = K = n—1 

clique 

maximal subgraph of nodes all of which 
are adjacent to each other 

Yes 

Yes: clique 
percolation 

Yes: struc¬ 
tural folds 

Yes: 

fc-cliques 

Slow 

relax distance 

max{dG{u^v)} < 
n 

n-clique 

maximal subgraph in which the largest 
geodesic distance is no greater than n 

No 

No 

No 

No 

Slow 

n-clique with 

diam < n 

n-clan 

n-clique that also have a diameter no 
greater than n 

No 

Yes 

No 

No 

Slow 

diam = n 

n-club 

a maximal subgraph of diameter n 

No 

Yes 

No 

No 

Slow 

relax degree 

S > n — k 

fc-plex 

maximal subgraph in which each node may 
be lacking ties to no more than k other 
nodes 

No 

Yes 

No 

No 

Slow 

6>k 

fc-core 

maximal subgraph in which all nodes have 
degree k or more 

No 

No 

No 

Yes 

Very fast 0{m) 

relax den¬ 
sity 

IV 

77 -dense sub¬ 
graph 

subgraph with density greater than or equal 
to 77 , where 0 < p < 1 

No 

No 

No 

No 

Slow 

Relative: Internal (+) External (-) 

density 

minimize edges to 
outside 

LS sets 

set of nodes in which each of its proper 
subsets has more ties with the nodes out¬ 
side that subset than the LS set itself 

Yes 

No 

Yes 

Yes 

Slow 0{rd) 

quality function of 
partitions 

modularity 

the fraction of the edges that fall within the 
given groups minus the expected such frac¬ 
tion if edges were distributed at random 

No 

No 

No 

No 

Optimum: Slow 
Approx: Last 

connectivity 

conductance 

weight of edge cut-sets among different 
subgroups 

edge-connectivity 

lambda sets 

maximal subset of nodes that have more 
edge independent paths between them than 
with nodes outside the subset 

Not as ro¬ 
bust as LS 

sets 

No 

No 

Yes 

Slow 0(rd) 

node-connectivity 

k- 

components 

maximal subgraph that has, at least, node 
connectivity k: we need to remove at least 
k nodes to break it into more components 

Yes 

Yes: k — 1 
nodes 

Yes 

Yes 

Exact: Slow 0{rd) 
Approx: <g; 0{rd) 

random walk based partition algorithms 

No 

No 

No 

No 

Last 


Table 1: Summary of cohesive subgroups formali z ations from social network analysis literature (Luce and Perry, 1949; Luccio and Sami, 1969; 
Lawler, 1973; Seidman and Foster, 1978; Mokken, 1979; Seidman, 1983b,a; Borgatti et ah, 1990; Wasserman and Faust, 1994; White and Harary, 2001; 
Moody and White, 2003; Brandes andErlebach, 2005; Fortunate, 2010). Notation: diam is diameter, g is density, 6 is minimum degree, A is edge- 
connectivity, K is node connectivity, n is the number of nodes, m is the number of edges, and dciu^ v) is the distance between nodes u and v in G. 




























More recently, under the label community analysis, an interdisciplinary community of re¬ 
searchers interested in complex networks has proposed a novel family of subgroup measures 
and algorithms (Fortunato, 2010). Essentially their approach is to divide a network into sub¬ 
groups by grouping nodes that are more densely connected among them than with the rest of 
the network. To objectively define how good a concrete partition of a network is, they define 
a quality function (Brandes and Erlebach, 2005; Eortunato, 2010). There are many different 
quality functions used in network literature, with most of them based on density, but also a few 
based on connectivity. The most popular quality function is modularity, which is computed as 
the fraction of the edges that fall within the given groups minus the expected value of the frac¬ 
tion if edges were distributed at random. However, the subgroups resulting from community 
analysis techniques are not hierarchically organized in the sociological sense discussed above 
because there is no natural nestedness among groups^ 

The first wave of community analysis focused on the analysis of non overlapping groups, but 
recent developments have explored overlapping community structures. The most interesting ap¬ 
proach of this kind is the clique percolation method (Palla, Derenyi, Earkas, and Vicsek, 2005) 
and their generalizations based on short cycles connectivity (Batagelj and Zaversnik, 2007). A 
A:-clique is a complete subgraph formed by k members. Two A:-cliques are considered adjacent 
if they share k — 1 actors. A /c-clique community is the largest connected subgraph obtained 
by the union of all adjacent /c-cliques. A:-clique communities can share nodes, so overlapping is 
possible. The clique percolation approach has proven to be a fertile ground over which to build 
theoretical developments on the positional dimension of cohesion. The concept of intercohesion 
based on the structural fold network topology (Vedres and Stark, 2010) is the most prominent 
example. Actors at structural folds are insiders in multiple cohesive subgroups (A:-clique com¬ 
munities). Thus they have access to diverse resources and information from each subgroup 
without being isolated and limited to only one group of neighbors. Vedres and Stark show that 
this distinctive structural position helps to explain innovation and entrepreneurial dynamics in 
the context of firm networks. 

However these new developments on community analysis are not well suited to address 
many of the common uses of group cohesion in the sociological literature. The clique percola¬ 
tion method assumes that the network under analysis has a large number of cliques, so it may 
fail to deliver meaningful results for networks with few cliques; also, if there are too many 
cliques, it may yield trivial results, such as considering the whole network a cohesive group 
without internal divisions. Moreover, this method is focused on finding subgraphs that contain 
many /c-cliques inside, which is not exactly the same as subgraphs more densely connected in¬ 
ternally than externally, because a A;-clique community could be formed by chains of /c-cliques 
with low edge density among non adjacent /c-cliques. This implies that /c-clique communities 
are not necessary robust to node removal. 

2.2 The structural cohesion model 

The structural cohesion approach to subgroup cohesion (White and Harary, 2001; Moody and White, 
2003) is grounded on two mathematically equivalent definitions of cohesion that are based on 
commonly used concepts of cohesion in the sociological literature. On the one hand, the ability 
of a collectivity to hold together independently of the will of any individual. As set out by the 
formal definition, “a group’s structural cohesion is equal to the minimum number of actors who, 
if removed from the group, would disconnect the group” (Moody and White, 2003, 109). Yet, 

'However, some of those methods are called hierarchical because they use hierarchical clustering to organize 
partitions in each step of the partition algorithm, which is commonly represented by a dendogram. Thus, re¬ 
searchers need to to introduce an arbitrary criteria to identify relevant partitions -that is, the level at which we cut 
the dendogram. 
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on the other hand, a eohesive group has multiple independent relational paths among all pairs of 
members. According to the formal definition “a group’s structural cohesion is equal to the mini¬ 
mum number of independent paths linking each pair of actors in the group” (Moody and White, 
2003, 109). These two definitions are mathematically equivalent in terms of the graph theoretic 
concept of connectivity as defined by Menger’s Theorem (White and Harary, 2001, 330), which 
can be formulated locally: “The minimum node cut set k{u, v) separating a nonadjacent ti, v 
pair of nodes equals the maximum number of node-independent u — v paths”; and globally: “A 
graph is /c-connected if and only if any pair of nodes «, v is joined by at least k node-independent 
u — v paths”. Thus Menger’s theorem links with an equivalence relation a structural property 
of graphs —connectivity based on cut sets— with how graphs are traversed —the number of 
node independent paths among pairs of different nodes. This equivalence relation has a deep 
sociological meaning because it allows for the definition of structural cohesion in terms of the 
difficulty to pull a group apart by removing actors and, at the same time, in terms of multiple 
relations between actors that keep a group together. 

The starting point of cohesion in a social group is a state where every actor can reach every 
other actor through at least one relational path. The emergence of a giant component —a large 
set of nodes in a network that have at least one path that links any two nodes— is a minimal 
condition for the development of group cohesion and social solidarity. Moody and White (2003) 
argue that, in this situation, the removal of only one node can affect the flow of knowledge, 
information and resources in a network because there is only one single path that links some 
parts of the network. Thus, if a network has actors who are articulation points, their role in 
keeping the network together is critical; and by extension the network can be disconnected by 
removing them. Moody and White (2003) convincingly argue that biconnectivity provides a 
baseline threshold for strong structural cohesion in a network because its cohesion does not 
depend on the presence of any individual actor and the flow of information or resources does 
not need to pass through a single point to reach any part of the network. Therefore, the concept 
of robustness is at the core of the structural cohesion approach to subgroup cohesion. 

Note that the bicomponent structure of a graph is an exact partition of its edges, which means 
that each edge belongs to one, and only one, bicomponent; but this is not the case for nodes 
because /c-components can overlap in A: — 1 nodes. In the case of bicomponents, articulation 
points belong to all bicomponents that they separate. Thus, this formalization of subgroup 
cohesion allows limited horizontal overlapping over /c-components of the same k. On the other 
hand, the /c-component structure of a network is inherently hierarchical because /c-components 
are nested in terms of connectivity: a connected graph can contain several 2-components, each 
of which can contain one or more tricomponents, and so forth. This is one of the bases over 
which the structural cohesion model is built and it is specially useful for operationalizing the 
hierarchical conception of nested social groups. 

However, one shortcoming of classifying cohesive subgroups only in terms of node connec¬ 
tivity is that /c-components of the same k are always considered equally cohesive despite the 
fact that one of them might be very close to the next connectivity level, while the other might 
barely qualify as a component of level k (i.e. removing a few edges could reduce the connectiv¬ 
ity level to /c — 1). White and Harary (2001) propose to complement node connectivity with the 
measure of conditional density. If a subgroup has node connectivity k, then its internal density 
can only vary within a limited range if the subgroup maintains that same level of connectivity. 
Thus, they propose to combine node connectivity and conditional density to have a continuous 
measure of cohesion. But connectivity is a better measure than density for measuring cohesion 
because there is no guarantee that a denser subgroup is more robust to node removal than a 
sparser one, given that both have the same node connectivity k. 

Building on this insight, we propose using another connectivity-based metric to obtain a 
continuous and more granular measure of cohesion: the average node connectivity. Node con- 
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nectivity is a measure based on a worst-ease seenario in the sense that to aetually break apart 
a k eonneeted graph by only removing k nodes we have to earefully ehoose which nodes to 
remove. Recent work on network robustness and reliability (Albert, Jeong, and Barabasi, 2000; 
Dodds, Watts, and Sabel, 2003) use as the main benchmark for robustness the tolerance to the 
random or targeted removal of nodes by degree; it is unlikely that by using either of these attack 
tactics we could disconnect a k connected graph by only removing k nodes. Thus node connec¬ 
tivity does not reflect the typical impact of removing nodes in the global connectivity of a graph 
G. Beineke, Oellermann, and Pippert (2002) propose the measure of average node connectivity 
of G, denoted k(G), defined as the sum of local node connectivity between all pairs of different 
nodes of G divided by the number of distinct pairs of nodes. Or put more formally: 


k{G) 


(2) 


( 1 ) 


Where n is the number of nodes of G. In contrast to node connectivity k, which is the mini¬ 
mum number of nodes whose removal disconnects some pairs of nodes, the average connectivity 
k{G) is the expected minimal number of nodes that must be removed in order to disconnect an 
arbitrary pair of nodes of G. For any graph G it holds that k{G) > k{G). As Beineke et al. 
show, average connectivity does not increase only with the increase in the number of edges: 
graphs with the same number of nodes and edges, and the same degree for each node can have 
different average connectivity (Beineke et al., 2002, figure 2, 33). Thus, this continuous mea¬ 
sure of cohesion doesn’t have the shortcomings of conditional density to measure the robustness 
of the cohesive subgroups. 

The relation between node connectivity and average node connectivity is analog to the re¬ 
lation between diameter and average distance. The diameter of a graph G is the maximum 
distance between any two nodes of G, and like node connectivity, it is a worst-case scenario. 
It does not reflect the typical distance that separates most pairs of nodes in G. When modeling 
distances between actors in networks, it is better to use the average path length (L) because it is 
close to the typical case: if we choose at random two nodes from a network, it is more likely that 
their distance is closer to the average than to the maximum distance. Taking into account the 
average connectivity of each one of the /c-components of a network allows a more fine grained 
conception of structural cohesion because, in addition to considering the minimum number of 
nodes that must be removed in order to disconnect a subgroup, we also consider the number of 
nodes that, on average, have to be removed to actually disconnect an arbitrary pair of nodes of 
the subgroup. The latter is a better measure of subgroup robustness than the departure of key 
individuals from the network. 

Structural cohesion is a powerful explanatory factor for a wide variety of interesting empiri¬ 
cal social phenomena. It can be used to explain, for instance: the likelihood of building alliances 
and partnerships among biotech firms (Powell et al., 2005); how positions in the connectivity 
structure of the Indian inter-organizational ownership network are associated with demographic 
features (age and industry); and differences in the extent to which firms engage in multiplex 
and high-value exchanges (Mani and Moody, 2014). Social cohesion can also help us under¬ 
stand degrees of school attachment and academic performance in young people, as well as the 
tendency of firms to enroll in similar political activity behaviors (Moody and White, 2003). It 
offers insight, also, into emerging trust relations among neighborhood residents or the hiring 
relations among top level US graduate programs (Grannis, 2009). In addition to social soli¬ 
darity and group cohesion, the model can equally fit many relevant theoretical issues, such as 
conceptualizing structural differences among fields and organizations (White et al., 2004), op¬ 
erationalizing the structural component of social embeddedness (Granovetter, 1985; Moody, 
2004), explaining the role of highly connected subgroups in boosting diffusion in social net¬ 
works without a high rate of decay (Moody, 2004; White and Harary, 2001), or highlighting the 
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complexity and diversity of the strueture of real world markets beyond stylized one-dimensional 
eharaeterizations of the market (Mani and Moody, 2014). 

Despite all its merits, the struetural cohesion model has not been widely applied to empirieal 
analysis beeause it is not practical to compute it for networks with more than a few thousands 
nodes and edges due to its eomputational eomplexity. What’s more, it is not implemented in 
most popular network analysis software paekages. In the next seetion, we will review the ex¬ 
isting algorithm to eompute the /c-eomponent strueture for a given network, before introdueing 
our heuristies to speed up the eomputation. 


3 Existing algorithms for computing /c-component structure 

Moody and White (2003, appendix A) provide an algorithm for identifying /c-eomponents in a 
network, whieh is based on the Kanevsky (1993) algorithm for finding all minimum-size node 
eut-sets of a graph; i.e. the set (or sets) of nodes of eardinality k that, if removed, would break 
the network into more conneeted eomponents. The algorithm eonsists of 4 steps: 

1. Identify the node eonneetivity, k, of the input graph using flow-based conneetivity algo¬ 
rithms (Brandes and Erlebaeh, 2005, ehapter 7). 

2. Identify all A:-eutsets at the current level of eonneetivity using the Kanevsky (1993) algo¬ 
rithm. 

3. Generate new graph eomponents based on the removal of these eutsets (nodes in the eutset 
belong to both sides of the induced eut). 

4. If the graph is neither complete nor trivial, return to 1; otherwise end. 

As the authors note, one of the main strengths of the structural cohesion approaeh is that 
it is theoretieally applieable to both small and large groups, whieh eontrasts with the historieal 
focus of the literature on small groups when dealing with cohesion. But the fact that this concept 
and the algorithm proposed by the authors, are theoretieally applieable to large groups does not 
mean that this would be a practical approach for analyzing the structural cohesion on large 
soeial networks 

The equivalenee relation established by Menger’s theorem between node eut sets and node 
independent paths ean be useful to eompute eonneetivity in practical cases but both measures 
are almost equally hard to eompute if we want an exact solution. However, White and Newman 
(2001) proposed a fast approximation algorithm for finding good lower bounds of the number 
of node independent paths between two nodes. This smart algorithm is based on the idea of 
searehing paths between two nodes, marking the nodes of the path as “used” and searehing for 
more paths that do not inelude nodes already marked. But instead of trying all possible paths 
without order, this algorithm eonsiders only the shortest paths: it finds node independent paths 
between two nodes by eomputing their shortest path, marking the nodes of the path found as 
“used” and then searehing other shortest paths exeluding the nodes marked as “used” until no 
more paths exist. Because finding the shortest paths is faster than finding other kinds of paths, 
this algorithm runs quite fast, but is not exaet beeause a shortest path eould use nodes that, if 
the path were longer, may belong to two different node independent paths (White and Newman, 
2001, seetion III). Therefore a condition for the use of this approximation algorithm would 
be that the networks analyzed should be sparse; this will reduce its inaccuraey beeause it will 

^The fastest implementation of this algorithm runs in 0{N^) time (Csardi and Nepusz, 2006) which is imprac¬ 
tical for moderately large networks. 
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be less likely that a shorter path uses nodes that eould belong to two or more longer node 
independent paths. 

White and Newman suggest that this algorithm could be used to find /c-components. First 
one should compute the node independent paths between all pairs of different nodes of the 
graph. Then build an auxiliary graph in which two nodes are linked if they have at least k 
node independent paths connecting them. The induced subgraph of all nodes of each connected 
component of the auxiliary graph form an extra-cohesive block of level k (like a fc-component 
but with the difference that not all node independent paths run entirely inside the subgraph). 
Finally, we could approximate the A:-component structure of a graph by successive iterations of 
this procedure. 

However, there are a few problems with this approach. First, a /^-component is defined as a 
maximal subgraph in which all pairs of different nodes have, at least, k node independent paths 
between them. If we rely on the connected components of the auxiliary graph as proposed by 
White and Newman (2001) we will include in a given /c-component all nodes that have at least k 
node independent paths with only one other node of the subgraph. Thus, the cohesive subgraphs 
detected won’t have to be /c-components as defined in graph theory. Second, /c-components 
can overlap in /c — 1 nodes. If we only consider connected components (i.e. 1-components) 
in the auxiliary graph, we will not be able to distinguish overlapping ^-components. Finally, 
the approach proposed by White and Newman is not practical in computational terms for large 
networks because of its recursive nature and because it needs to compute node independent 
paths for all pairs of different nodes in the network as starting point. 


4 Heuristics for computing /c-components and their average 
connectivity 

The logic of the algorithm presented here is based on repeatedly applying fast algorithms for 
/c-cores (Batagelj and Zaversnik, 2011) and biconnected components (Tarjan, 1972) in order to 
narrow down the number of pairs of different nodes over which we have to compute their local 
node connectivity for building the auxiliary graph in which two nodes are linked if they have 
at least k node independent paths connecting them. We follow the classical insight that, “/c- 
cores can be regarded as seedbeds, within which we can expect highly cohesive subsets to be 
found” Seidman (1983b, 281). More formally, our approach is based on Whitney’s theorem 
(White and Harary, 2001, 328), which states an inclusion relation among node connectivity 
k{G), edge connectivity A(G) and minimum degree 6{G) for any graph G: 

i^{G) < A(G) < A(G) (2) 

This theorem implies that every fc-component is nested inside a /c-edge-component, which 
in turn, is contained in a k-core. This approach, unlike the proposal of White and Newman 
(2001), does not require computing node independent paths for all pairs of different nodes as a 
starting point, thus saving an important amount of computation. Moreover it does not require 
recursively applying the same procedure over each subgraph. In our approach we only have 
to compute node independent paths among pairs of different nodes in each biconnected part of 
each /c-core, and repeat this procedure for each k from 3 to the maximal core number of a node 
in the input network. 

The aim of the heuristics presented here is to provide a fast and reasonably accurate way 
of analyzing the cohesive structure of empirical networks of thousands of nodes and edges. As 
we have seen, ^-components are the cornerstone of structural cohesion analysis. But they are 
very expensive to compute. Our approach consists of computing extra-cohesive blocks of level 
k for each biconnected component of a k-core. Extra-cohesive blocks are a relaxation of the 
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/c-component concept in which not all node independent paths among pairs of different nodes 
have to run entirely inside the subgraph. Thus, there is no guarantee that an extra-eohesive bloek 
of level k aetually has node conneetivity k. We introduee an additional eonstraint to the extra- 
eohesive block concept in order to approximate /c-eomponents: our algorithm eomputes extra- 
eohesive blocks of level k that are also fc-cores by themselves in G. Based on several tests with 
synthetic and empirical networks presented below, we show that usually extra-eohesive bloeks 
detected by our algorithm have indeed node eonnectivity k. Futhermore, extra-cohesive bloeks 
maintain high requirements in terms of multieonnectivity and robustness, thus conserving the 
most interesting properties from a soeiological perspective on the structure of social groups. 

Combining this logic with three observations about the auxiliary graph H allows us to de¬ 
sign a new algorithm for finding extra-eohesive blocks in each biconnected component of a 
k-core, that ean either be exaet but slow —using flow-based algorithms for loeal node eon- 
neetivity (Brandes and Erlebaeh, 2005, Chapter 7)— or fast and approximate, giving a lower 
bound with eertifieate of the eomposition and the eonneetivity of extra-eohesive bloeks —using 
White and Newman (2001) approximation for loeal node eonneetivity. Onee we have a fast 
way to eompute extra-eohesive blocks, we ean approximate fc-eomponents by imposing that the 
induced subgraph of the nodes that form an extra-eohesive bloek of G have to also be a k-core 
in G. 

Let H be the auxiliary graph in whieh two nodes are linked if they have at least k node 
independent paths conneeting them in eaeh of the bieonnected components of the core of level 
k of original graph G (for k > 2). The first observation is that complete subgraphs in H {Hdique) 
have a one to one eorrespondenee with subgraphs of G in which each node is conneeted to every 
other node in the subgraph for at least k node independent paths. Thus, we have to seareh for 
eliques in H in order to diseover extra-eohesive bloeks in G. 

The seeond observation is that an Hdique of order n is also a eore of level n — 1 (all nodes 
have eore number n — 1), and the degree of all nodes is also n — 1. The auxiliary graph H 
is usually very dense, beeause we build a different H for eaeh bieonneeted part of the core 
subgraph of level k of the input graph G. In this kind of network big elusters of almost fully 
conneeted nodes are very eommon. Thus, in order to seareh for eliques in H we ean do the 
following: 

1. For each core number value c^aiue in eaeh bieonneeted component of H: 

2. Build a subgraph Hcandidate of H indueed by the nodes that have exactly eore number 
Cvaiue- Note that this is different than building a fc-core, whieh is a subgraph induced by 
all nodes with core number greater or equal than c^aiue- 

3. If Hcandidate h^s Order Cyaiue + 1 thou it is a clique and all nodes will have degree n — 1. 
Return the elique and eontinue with the following eandidate. 

4. If this is not the ease, then some nodes will have degree < n — 1. Remove all nodes with 
minimum degree from Hcandidate- 

5. If the graph is trivial or empty, eontinue with the following eandidate. Or otherwise 
reeompute the eore number for each node and go to 3. 

Finally, the third observation is that if two fc-eomponents of different order overlap, the 
nodes that overlap belong to both eliques in H and will have core numbers equal to all other 
nodes in the bigger clique. Thus, we ean account for possible overlap when building subgraphs 
Hcandidate (induced by the nodes that have exactly eore number c^aiue) by also adding to the 
eandidate subgraph the nodes in H that are eonnected to all nodes that have exactly core number 
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Cvaiue- Also, if wc sort the subgraphs Hcandidate in reverse order (starting from the biggest), we 
ean skip ehecking for possible overlap for the biggest. 

Based on these three observations, our heuristies for approximating the eohesive strueture 
of a network and the average eonneetivity of eaeh individual bloek, consists of: 

Let G be the input graph. Compute the core number of each node in G. For each k from 
3 to the maximum core number build a fc-core subgraph Gk-core with all nodes in G with core 
level> k. 

For each biconnected component of Gk-core- 

1. Compute local node connectivity k{u, v) between all pairs of different nodes. Optionally 
store the result for each pair. Either use a flow-based algorithm (exact but slow) or White 
and Newman’s approximation for local node connectivity (approximate but a lot faster). 

2. Build an auxiliary graph H with all nodes in this bicomponent of Gk-core with edges 
between two nodes if «:(«, v) > k. For each biconnected component of H: 

3. Compute the core number of each node in Hhicomponent, sort the values in reverse order 
(biggest first), and for each value Cyaiue- 

(a) Build a subgraph Hcandidate induced by nodes with core number exactly equal to 
Cvaiue plus uodcs in H that are conected with all nodes with core number equal to 

Cvaiue- 

i. If Hcandidate hus ordcr Cvaiue +1 then it is a clique and all nodes will have degree 
n — 1. Build a core subgraph G candidate of level k of G induced by all nodes in 
Hcandidate that havc corc number > /c in G. 

ii. If this is not the case, then some nodes will have degree < n — 1. Remove all 
nodes with minimum degree from Hcandidate- Build a core subgraph Gcandidate 
of level k of G induced by the remaining nodes of Hcandidate that have core 
number > k in G. 

A. If the resultant graph is trivial or empty, continue with the following candi¬ 
date. 

B. Else recompute the core number for each node in the new Hcandidate and go 
to (i). 

(b) The nodes of each biconnected component of Gcandidate are assumed to be a k- 
component of the input graph if the number of nodes is greater than k. 

(c) Compute the average connectivity of each detected /c-component. Either use the 
value of k{u, v) computed in step 1 or recalcualte k,{u, v) in the induced subgraph 
of candidate nodes. 

Notice that because our approach is based on computing node independent paths between 
pairs of different nodes, we are able to use these computations to calculate both the cohesive 
structure and the average node connectivity of each detected /c-component. Of course, comput¬ 
ing average connectivity comes with a cost: either more space to store k{u, v) in step 1, or more 
computation time in step 3.c if we did not store k{u, v). This is not possible when applying the 
exact algorithm for /c-components proposed by Moody and White (2003) because it is based 
on repeatedly finding /c-cutsets and removing them, thus it does not consider node independent 
paths at all. 

The output of these heuristics is an approximation to /c-components based on extra-cohesive 
blocks. We find extra-cohesive blocks and not /c-components because we only build the aux¬ 
iliary graph H one time on each bicoennected component of a core subgraph of level k from 
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the input graph G. Local node connectivity is computed in a subgraph that might be larger than 
the final Gcandidate and thus some node independent paths that shouldn’t could end up being 
counted. 

Accuracy can be improved by rebuilding H from the pairwise node connectivity in Gcandidate 
and following the remaining steps of the heuristics at the cost of slowing down the computation. 
There is a trade-off between speed and accuracy. After some tests we decided to compute H 
only once and lean towards the speed pole of the trade-off. Our goal is to have an usable proce¬ 
dure for analyzing networks of thousands of nodes and edges in which we have substantive in¬ 
terests. Following this goal, the use of White and Newman (2001) approximation algorithm for 
local node connectivity in step 3.b is key. It is almost on order of magnitude faster than the ex¬ 
act flow-based algorithms. As usual, speed comes with a cost in accuracy: White and Newman 
(2001) algorithm provides a strict lower bound for the local node connectivity. Thus, by using it 
we can miss an edge in H that should be there. Therefore, a node belonging to a fc-component 
could be excluded by the algorithm if we use White and Newman (2001) approximation in step 
3.b . This is a source of false negatives in the process of approximating the /^-component struc¬ 
ture of a network. However, as we discussed above, the inaccuracy of this algorithm for sparse 
networks in reduced because in those networks the probability that a short node independent 
path uses nodes that could belong to two or more longer node independent paths is low. 

Our tests reveal that the use of White and Newman (2001) approximation does indeed un¬ 
derestimate the order of some A:-components, particularly in not very sparse networks. One 
approach to mitigate this problem is to relax the strict cohesion requirement of Hcandidate being 
a clique. Following the network literature on cliques, we can relax its cohesion requirements 
in terms of degree, coreness and density. We did some experiments and found that a good 
relaxation criteria is to set a density threshold of 0.95 for Hcandidate'^ it doesn’t increase false 
positives and does decrease the false negatives derived from the underestimation of local node 
connectivity of White and Newman (2001) algorithm. Other possible criteria that has given 
good results in our tests is permitting a variation in degree of 2 in Hcandidate —that is, that the 
absolute difference of the maximum an the minimum degree in Hcandidate is at most 2. The 
former relaxation criteria is used for all analysis presented below and in the appendix. 

This algorithm can be easily generalized so as to be applicable to directed networks provided 
that the implementation of White and Newman’s approximation for pairwise node independent 
paths supports directed paths (which is the case in our implementation of this algorithm on top 
of NetworkX library). The only change needed then is to use strongly connected components 
instead of bicomponents. And, in step 3, to start with core number 2 instead of 3. 

In appendix Appendix A we present an illustration of the heuristics using a convenient small 
synthetic network. In appendix Appendix B we present an analysis of the performance of 
the heuristics compared to the performance of the exact algorithm for finding /c-components 
(Moody and White, 2003). In appendix Appendix C we discuss the implementation details of 
the heuristics; and in appendix Appendix D we present the python code of our implementation 
of the heuristics for illustrative purposes^. 

5 Structural cohesion in collaboration networks 

The structural cohesion model can be used to explain cooperation in different kinds of col¬ 
laboration networks; for instance, coauthorship networks (Moody, 2004; White et ah, 2004) 
and collaboration among biotech firms (Powell et ah, 2005). Most collaboration networks are 
bipartite because the collaboration of individuals has as a result —or, at least, as a relevant 
byproduct— some kind of object or event to which its authors are related. All these papers 

^The fully functional Python code is available from the authors 
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follow the usual practice to deal with two-mode networks: focus the analysis only on one-mode 
projections. As such, we don’t know how much information about their cohesive structure we 
lose by ignoring the underlying bipartite networks. Recent literature on two-mode networks 
strongly suggests that it is necessary to analyze two-mode networks directly to get an accurate 
picture of their structure. For instance, in small world networks, we do know that focusing only 
on projections overestimates the smallworldiness of the network (Uzzi et ah, 2007). We also 
know that generalizing clustering coefficients to bipartite networks can offer key information 
that is lost in the projection (Robins and Alexander, 2004; Lind et ah, 2005; Opsahl, 2011). Fi¬ 
nally, the loss of information is also critical in many other common network measures: degree 
distributions, density, and assortativity (Latapy et ah, 2008). We show that this is also the case 
for the /c-component structure of collaboration networks. 

Structural cohesion analysis based on the A:-component structure of bipartite networks has 
been conducted very rarely and only on very small networks (White et ah, 2004). The limited 
diffusion of these studies can be readily explained by the fact that bipartite networks are usually 
quite a lot bigger than their one-mode counterparts, and the computational requirements, once 
again, stifled empirical research in this direction. Other measures have been developed to deal 
with cohesion in large bipartite networks, such as (p, g)-cores or 4-ring islands (Ahmed et ah, 
2007). However, the former is a bipartite version of fc-cores and thus it has the same limita¬ 
tions for subgroup identification; while the latter is very useful to determine subgraphs in large 
networks that are more strongly connected internally than with the rest of the network, but also 
lacks some of the key elements of the definition for groups in the sociological literature, such 
as being hierarchical and allowing for overlaps. 


Network 

# nodes 

Bi] 

# edges 

lartite 

Av. degree 

Time(s) 

# nodes 

Uni 

# edges 

partite 

Av. degree 

Time(s) 

Debian Lenny 

13121 

20220 

3.08 

1105.2 

1383 

5216 

7.54 

204.7 

High Energy (theory) 

26590 

37566 

2.81 

3105.7 

9767 

19331 

3.97 

7136.0 

Nuclear Theory 

10371 

15969 

3.08 

1205.2 

4827 

14488 

6.00 

3934.1 


Table 2: Collaboration networks analyzed from science and from software development. See 
text for details on their content. Time refers to the execution of our heuristics on each network 
expressed in seconds. 


The heuristics for structural cohesion presented here allows us to compute connectivity- 
based measures on large networks (up to tens of thousands of nodes and edges) quickly enough 
to be able to build suitable null models. Furthermore we will be able to compare the results for 
bipartite networks with their one-mode projections. To illustrate those points we use data on 
collaboration among software developers in one organization (the Debian project) and scientists 
publishing papers in the arXiv.org electronic repository in two different scientific fields: High 
Energy Theory and Nuclear Theory. We built the Debian collaboration network by linking each 
software developer with the packages (i.e. programs) that she uploaded to the package reposi¬ 
tory of the Debian Operating System during a complete release cycle. We analyze the Debian 
Operating System version 5.0, codenamed “Lenny”, which was developed from April 8, 2007, 
to February 1, 2009. Scientific networks are built using all the papers uploaded to the arXiv.org 
preprint repository from January 1, 2006, to December 31, 2010, for two well established sci¬ 
entific fields: High Energy Physics Theory and Nuclear Theory. In these networks each author 
is linked to the papers that she has authored during the time period analyzed. One-mode projec¬ 
tions are always on the human side: scientists linked together if they have coauthored a paper, 
and developers linked together if they have worked on the same program. Table 2 presents some 
details on those networks. 
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In the remaining part of this seetion we perform three kinds of analysis to demonstrate 
the loss of information we ineur when foeusing only on one-mode projeetions when dealing 
with bipartite networks. First, we present a tree representation of the /c-eomponent strueture — 
the eohesive bloeks strueture (White and Harary, 2001; Moody and White, 2003; White et ah, 
2004; Mani and Moody, 2014)— for our bipartite networks and their one-mode projeetions, 
both for actual networks and for their random counterparts. Second, we present a comparison 
among actual and random networks (both for one and two-mode) on the /c-number frequencies 
of nodes. Finally, we present a novel graphic representation of the structural cohesion of a 
network, based on three-dimensional scatter plot, using average node connectivity as a synthetic 
and more informative measure of cohesion of each /c-component. 

For the first two analyses we do need to generate null models in order to discount the pos¬ 
sibility that the observed structure of actual networks is just the result of randomly mixing 
papers and scientists or packages and developers. The null models used in this paper are based 
on a bipartite configuration model (Newman, 2003), which consists of generating networks by 
randomly assigning papers/programs to scientists/developers but maintaining constant the dis¬ 
tribution of papers per scientists and scientists by paper observed in the actual networks, that 
is the bipartite degree distribution. For one-mode projections, we generated bipartite random 
networks based on their original bipartite degree distribution, and then performed the one-mode 
projection. This is a common technique for avoiding overestimating the local clustering of one¬ 
mode projections (Uzzi et ah, 2007). As the configuration model can generate some multiple 
edges and self-loops, we followed the usual practice of deleting them before the analysis in 
order to guarantee that random networks are simple, like actual networks. 

So let’s start with the tree representation of the cohesive blocks structure. As proposed by 
White et al. (2004), we can represent the /c-component structure of a network by drawing a 
tree whose nodes are /c-components; two nodes are linked if the /c-component of higher level 
is nested inside the A:-component of lower level (see Mani and Moody (2014, 1643,1651) for 
this kind of analysis on the Indian interorganizational ownership network). This representation 
of the connectivity structure can be built during the run time of the exact algorithm. However, 
because our heuristics are based on finding node independent paths, we have to compute first the 
/c-components hierarchy, and then construct the tree that represents the connectivity structure of 
the network. 
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Figure 1: Cohesive bloeks for two-mode and one-mode Nuelear Theory eollaboration networks, and for their random eounterparts. Random networks 
were generated using a bipartite eonfiguration model. We built 1000 random networks and ehose one randomly, see text for details. For lower eonneetivity 
levels we have removed some small fc-eomponents to improve the readability: we do not show l-eomponents with less than 20 nodes, 2-eomponents with 
less than 15 nodes, or trieomponents with less than 10 nodes. 
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Figure 2: Cohesive bloeks for two-mode and one-mode Debian eollaboration networks, and for their random eounterparts. Random networks were 
generated using a bipartite eonfiguration model. We built 1000 random networks and ehose one randomly, see text for details. For lower eonneetivity 
levels we have removed some small fc-eomponents to improve the readability: we do not show l-eomponents with less than 20 nodes, 2-eomponents with 
less than 15 nodes, or trieomponents with fewer than 10 nodes. 





Figures la and Ic show the eonneetivity strueture of Nuelear Theory eollaboration networks 
represented as a tree, the former for the two-mode network and the latter for one-mode ones. 
As we ean see, both networks display non-trivial strueture. The two-mode network has up 
to an 8-oomponent, but most nodes are in /c-eomponents with k < 6. Up to k = 3 most 
nodes are in giant A:-eomponents, but for k = {4, 5} there are many A:-components of similar 
order. Figure le, which corresponds to the one-mode projection, has a lot more connectivity 
levels —a byproduct of the mathematical transformation from two-mode to one-mode. In this 
network, the maximum connectivity level is 46; the four long legs of the plot correspond to 4 
cliques with 47, 31, 27 and 25 nodes. Notice that each one of these 4 cliques are already a 
separated /c-component at k = 7. It is at this level of connectivity (k = {7, 8}) where the giant 
/c-components start to dissolve and many smaller /c-components emerge. 

In order to be able to assess the significance of the results obtained, we have to compare 
the connectivity structure of actual networks with the connectivity structure of a random net¬ 
work that maintains the observed bipartite degree distribution. In this case, we compare actual 
networks with only one random network. We obtained it by generating 1000 random networks 
and choosing one randomly. Figures lb and Id show the connectivity structure of the random 
counterparts for Nuclear Theory collaboration networks. For the two-mode network, instead 
of the differentiated connectivity structure displayed by the actual bipartite network, there is a 
flatter connectivity structure, where the higher level /c-component is a tricomponent. Moreover, 
instead of many small /c-components at high connectivity levels, the random bipartite network 
has only giant /c-components where all nodes with component number k are. In this case, the 
one-mode network is also quite different from its random counterpart. There are only giant 
/c-components up until k = 15, where the four cliques observed in the actual network separate 
from each other to form distinct /c-components. 

The hierarchy of the connectivity structure displayed in these plots allows us to do mean¬ 
ingful comparisons between networks in terms of their connectivity structure. For instance, 
figures 2a and 2c show the connectivity structure of Debian collaboration networks. The former 
displays the bipartite connectivity structure, which is quite different from two-mode Nuclear 
Theory structure discussed above. Although there are some small /c-components for each con¬ 
nectivity level, most of the nodes with /c-number /c are in a giant /c-component that encompasses 
most of the nodes of that level. Even at the top level of connectivity (/c = 5), 80 percent of the 
88 nodes with /c-number 5 are in the same 5-component. Figure 2c displays the cohesive block 
structure for its one-mode projection. It consists of a monotonous linear succession of increas¬ 
ingly smaller /c-components nested inside each other. 

Figures 2b and 2d show the connectivity structure of the random counterparts of Debian 
collaboration networks. The random one-mode projection has the same structure than its actual 
counterpart, a single long chain of /c-components nested inside each other. However, the random 
two-mode structure is quite different from its actual counterpart: it consists of a chain of single 
cohesive blocks. At lower connectivity levels, up to /c = 3, the random network have more 
nodes in those giant /c-components than its actual counterpart; but the actual Debian two-mode 
network has a bigger 4-component and also 2 5-components that are not present in its random 
counterpart. Thus, in terms of their connectivity structure, two-mode networks are farther apart 
from their random counterparts than their one-mode projections. 

Note that, so far, the comparison of actual networks with their random counterparts has 
focused on a single random network. But, a single random network is not a sound null model. 
We do need to generate a large enough set of them and perform the connectivity analysis to 
have an accurate picture of possible connectivity structures generated solely by chance given 
the observed bipartite degree distribution. A good way to evaluate the differences between the 
actual network and the set of random networks is comparing the frequencies of /c-numbers of 
their nodes. A node’s /c-number, or component number, is the value k of the highest order k- 
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component in which it is embedded. In the barplots displayed in figure 3, each bar represents 
the number of nodes that have /c-number k. Green bars represent /c-number frequencies for the 
actual networks and blue bars represent the average value of 64 random networks that maintain 
the degree distribution of the original two-mode network. We analyzed 64 random networks to 
keep computation time reasonable, but we generated ten times more random networks and we 
have randomly selected one of each ten to perform the actual analysis. 

Figure 3 shows that two-mode and one-mode projections of the same network yield quite 
different results in terms of /c-number distribution among nodes when compared with their 
random counterparts. Bipartite collaboration networks have slightly fewer nodes with low com¬ 
ponent number (2 and sometimes 3) than their random counterparts. However, they have a lot 
more nodes in higher levels of connectivity. This means that, in bipartite random networks, 
the edges are more evenly distributed among all nodes. Thus more nodes are embedded in bi¬ 
components, and in some cases, tricomponents; but also for this same reason, random networks 
have a lot fewer nodes in /c-components of higher order (4, 5 or 6) than actual networks. There¬ 
fore, we can conclude that bipartite collaboration networks are significantly more hierarchical 
in connectivity terms than their random counterparts. As this hierarchy cannot be explained in 
terms of random mixing papers/programs with scientists/developers, it must be the result of an 
underlying organization principle that shapes the structure of these collaboration networks. 

Going one step beyond classical structural cohesion analysis, as proposed above, we can 
deepen our analysis by also considering the average connectivity of the /c-components of these 
networks. By analogy with the /c-component number of each node, which is the maximum 
value k of the deepest /c-component in which that node is embedded, we can establish the av¬ 
erage /c-component number of each node as the value of average connectivity of the deepest 
/c-component in which that node is embedded. Notice that, unlike plain node connectivity, av¬ 
erage node connectivity is a continuous measure of cohesion. Thus it provides a more granular 
measure of cohesion because we can rank /c-components with the same k according to their 
average node connectivity. 

Figure 4 graphically represents the three networks with three-dimensional scatter plots^. In 
these graphs, each dot corresponds to a node of the network, for two-mode networks nodes 
represent both scientists/developers and papers/programs. The Z axis (the vertical one) is 
the average /c-component number of each node, and the X and Y axis are the result of a 2 
dimensional force-based layout algorithm implemented by the neato program of Graphviz 
(Ellson et ah, 2002). The two dimensional layout is computed by constructing a virtual phys¬ 
ical model and then using an iterative solver procedure to obtain a low-energy configuration. 
Following Kamada and Kawai (1989), an ideal spring is placed between each pair of nodes 
(even if they are not connected in the network). The length of each spring corresponds to the 
geodesic distance between the pair of nodes that it links. The final node positioning in the layout 
approximates the path distance among pairs of nodes in the network. 

This novel graphic representation of cohesion structure is inspired by the approximation 
technique developed by Moody (2004) for plotting the approximate cohesion contour of large 
networks to which is not practical to apply Moody and White’s exact algorithm for /c-components 
2003. Moody’s technique is based on the fact that force-based layouts algorithms tend to draw 
nodes within highly cohesive subgroups near each other. Then we have to divide the surface of 
the two-dimensional plane in squares of equal areas and compute node independent paths on a 
sample of pairs of nodes inside each square so as to obtain an approximation for the node con¬ 
nectivity in that square. Then we can draw a surface plot using a smoothing probability density 
function. However, in order to obtain a nice smooth surface plot, we have to use heavy smooth¬ 
ing in the probability density function, and carefully choose the area of the squares (mostly by 
trial and error). Moreover, this technique strongly relies on the force-based layout algorithm to 

■^These plots are produced with the powerful Matplotlib python library (Hunter, 2007). 
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(a) Bipartite network formed by developers and 
packages during 2 years of collaboration (from 2007 
to 2009) on the release codenamed Lenny of the De- 
bian operating system 



(b) Unipartite network formed by developers dur¬ 
ing 2 years of collaboration (from 2007 to 2009) on 
the release codenamed Lenny of the Debian oper¬ 
ating system 



(c) Bipartite network formed by scientists and (d) Unipartite network formed by scientists during 5 
preprints during 5 years (2006-2010) in the high en- years (2006-2010) in the high energy physics (theory) 
ergy physics (theory) section of arXiv.org section of arXiv.org 




(e) Bipartite network formed by scientists and (f) Unipartite network formed by scientists during 5 years 
preprints during 5 years (2006-2010) in the nuclear (2006-2010) in the nuclear theory section of arXiv.org 
physics (theory) section of arXiv.org 


Figure 3: Barplots of /c-number frequencies for two-mode and one-mode collaboration networks 
and their random counterparts. Green bars represent the actual fc-number frequencies and blue 
bars represent the average A:-number frequencies for 64 random networks that maintain the 
degree distribution of the original two-mode network. 
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put nodes in highly cohesive subgroups near each other —something which is not guaranteed 
because they are usually based in path distance and not directly on node connectivity. Because 
we are able to compute the /c-component structure with our heuristics for large networks, the 
three-dimensional scatter plot only relies on the layout algorithm for setting the X and Y posi¬ 
tions of the nodes, while the Z position (average node connectivity) is computed directly from 
the network. Moreover, we don’t have to use a smoothed surface plot because we have a value 
of average connectivity for each node, and thus we can plot each node as a dot on the plot. This 
gives a more accurate picture of the actual cohesive structure of a network. 

Our synthetic representation of their cohesive structures can help researchers visualize the 
presence of different organizational mechanisms in different kinds of collaboration networks. 
The difference between the Debian and the scientific collaboration networks is striking. In 
figure 4a we can see the scatter plot for a Debian bipartite network. We can observe a clear 
vertical separation among nodes in different connectivity levels. This is because almost all 
nodes in each connectivity level are in a giant /^-component and thus they have the same average 
connectivity. In other words, developers in Debian show different levels of engagement and 
contribution, with a core group of developers deeply nested at the core of the community. This 
pattern is the result of formal and informal rules of collaboration that evolved over the years 
(O’Mahony and Ferraro, 2007) into a homogeneous hierarchical structure, where there is only 
one core of highly productive individuals at the center. Not surprisingly, perhaps, the Debian 
project has been particularly resilient to developers’ turnover and splintering factions. 

Scientific collaboration networks show a rather different structure of collaboration. The 
two-mode science collaboration networks (figures 4c and 4e) display a continuous hierarchical 
structure in which there are nodes at different levels of average connectivity for each discrete 
plain connectivity level. This is because science collaboration networks have a complex co¬ 
hesive block structure where there are a lot of independent A:-components in each plain con¬ 
nectivity level, for A: > 3. Each small cohesive block has a different order, size and average 
connectivity; thus, when we display them in this three-dimensional scatter plot we observe 
a continuous hierarchical structure that contrasts with the almost discrete structure of Debian 
collaboration networks. 

One explanation why we observe this heterogeneous connectivity structure is that scientific 
collaborations cluster around a variety of different aims, methods, projects, and institutional 
environments. Therefore as the most productive scientists collaborate with each other, hierar¬ 
chies naturally emerge. However, we are less likely to observe one single hierarchical order as 
we did in the Debian network, as more than one core of highly productive scientists is likely 
to emerge. In a way our visualization captures the structure of the “invisible college” of the 
scientific discipline. 

If we compare the bipartite networks with their one-mode projections using this graphical 
representation (see figures 4b, 4d, and ??) we can see that, again, they look quite different. 
While bipartite average connectivity structure for the Debian network is characterized by clearly 
defined and almost discrete hierarchical levels, its one-mode counterpart shows a continuous 
hierarchical structure. However, this is not caused by the presence of many small /^-components 
at the same level k, as in the case of bipartite science networks discussed above, but by the close 
succession of hierarchy levels with almost the same number of nodes in a chain-like structure 
(as depicted in figure 2c). 

For collaboration science networks, the three-dimensional scatter plots of one-mode pro¬ 
jections are also quite different than their original bipartite networks. They have a lot more 
hierarchy levels than bipartite networks but most nodes are at lower connectivity levels. Only a 
few nodes are at top levels of connectivity, and they all form part of some clique, which are the 
groups in the long “legs” of the cohesive block structure depicted in figure Ic. Thus, the com¬ 
plex hierarchical connectivity structure of bipartite collaboration networks gets blurred when 
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(e) High Energy Theory 2 mode 

Figure 4: Average conneetivity three-dimensional seatter plots. X and Y are the positions deter¬ 
mined by the Kamada-Kawai layout algorithm. The vertieal dimension is average eonneetivity. 
Eaeh dot is a node of the network and two-mode networks eontain both papers/programs and 
seientists/developers. 
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we perform one-mode projeetion. An important eonsequenee of the projeetion is that only a 
few nodes embedded in big eliques appear at top conneetivity levels and all other nodes are way 
down in the eonneetivity strueture. This eould lead the risk of overestimating the importanee of 
those nodes in big eliques and to underestimate the importance of nodes that, despite being at 
high levels of the bipartite connectivity structure, appear only at lower levels of the unipartite 
connectivity structure. 


6 Conclusions 

This article contributes to our understanding of structural cohesion in a number of ways. 

First, we extended theoretically the structural cohesion model by considering not only plain 
node connectivity, which is the minimum number of nodes that must be removed in order to dis¬ 
connect a network, but also the average node connectivity of networks and its cohesive groups, 
which is the number of nodes that, on average, must be removed to disconnect an arbitrary pair 
of nodes in the network. Taking into account average connectivity allows a more granular con¬ 
ception of structural cohesion, and we show in our empirical analysis of collaboration networks 
how this approach leads to useful implications in empirical research. 

Second, we developed heuristics to compute the /c-components structure, along with the 
average node connectivity for each /c-component, based on the fast approximation to compute 
node independent paths (White and Newman, 2001). These heuristics allow for the comput¬ 
ing of the approximate value of group cohesion for moderately large networks, along with all 
the hierarchical structure of connectivity levels, in a reasonable time frame. We showed that 
these heuristics can be applied to networks at least one order of magnitude bigger than the ones 
manageable by the exact algorithm proposed by Moody and White (2003). To ensure repro¬ 
ducibility and facilitate diffusion of these heuristics we provided a very detailed description of 
the implementation, along with an illustration of the source code 

Finally, we used the heuristics proposed here to analyze three large collaboration networks. 
With this analysis, we showed that the heuristics and the novel visualization technique for co¬ 
hesive network structure help us capture important differences in the way collaboration is struc¬ 
tured. Obviously a detailed analysis of the institutional and organizational structures in which 
the collaborative activity took place is well beyond the scope and aims of this paper. But future 
research could leverage the tools we provide to systematically measure those structures. For 
instance, sociologists of science often compare scientific disciplines in terms of their collabora¬ 
tive structures (Moody, 2004) and their level of controversies (Shwed and Bearman, 2010). The 
measures and the visualization technique we proposed could nicely capture these features and 
compare them across scientific disciplines. This would make it possible to further our under¬ 
standing of the social structure of science, and its impact in terms of productivity, novelty and 
impact. Social network researchers interested in organizational robustness would also benefit 
from leveraging the structural cohesion measures to detect sub-groups that are more critical to 
the organization’s resilience, and thus prevent factionalization. Exploring the consequences of 
different forms of cohesive structures will eventually help us further our theoretical understand¬ 
ing of collaboration and the role that cohesive groups play in linking micro-level dynamics with 
macro-level social structures. 


^We believe that providing detailed implementation is critical to ensure reproducibility, but often these details 
are black-boxed, some times because of proprietary software restrictions or authors’ reluctance to share their work. 
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Appendix A Illustration of the heuristics 

In order to illustrate how the proposed heuristies works, we will use a eonvenient synthetie 
network with 99 nodes and 200 edges where k ^ 5. This network is based on a two dimensional 
grid of 5 by 5 nodes. In eaeh eomer of the grid we attaeh a Petersen graph (P), linked by two 
edges to the grid. Thus the only four nodes of the grid with degree 2 are linked to a Petersen 
graph. All nodes of the grid are therefore part of a 3-eore. Eaeh P is linked to two eomplete 
graphs with 5 nodes (/T 5 ); in two eases those two overlap in only one node and in the other 
two cases, they overlap in two nodes. The Petersen graph is linked by three edges to one of the 
P’s, thus making one of each part of a tricomponent along with P. In the case of the two 
that overlap only on one node, the outer has also one edge linking one of its nodes with 
one node of P nodes, in order to make the whole graph biconnected (see figure 5). Petersen 
graphs have node connectivity 3 and complete graphs with 5 nodes have node connectivity 4. 
Notice that the whole example graph is biconnected and a 3-core, but it has three levels of node 
connectivity: 2 for the grid, 3 for the Petersen graphs (P) and 4 for the complete graphs of 5 
nodes (P'5). 



(a) Nodes colored by component number according to (b) Nodes colored by component number according to 
our algorithm. Note the error when two overlap in Moody & White algorithm, 
two nodes 

Figure 5: Synthetic graph composed of a two dimensional grid of 25 nodes, four Petersen graphs 
(P) with ten nodes each (with k = 3) linked by two edges to the grid, and eight complete graphs 

(with K = 4) linked by three edges to each Petersen graph. In two cases overlap in 1 
node and in the other two cases they overlap in 2 nodes. The whole graph is biconnected and 
also a tricore. Notice that our algorithm fails to classify the two that overlap in two nodes 
as 4-components. See text and figure figure 7 for details. 

As discussed above, a fc-core is a maximal subgraph that contains nodes of degree k or more. 
The core number of a node is the largest value A: of a k-core containing that node. On the other 
hand, a A:-component is a maximal subgraph that cannot be disconnected by removing less than 
k nodes. The component number of a node is the largest value A: of a A:-component containing 
that node. 

The graph of figure 5 is a biconnected 3-core, which means that it is a graph with minimum 
degree = 3 that cannot be disconnected by removing less than 2 nodes. Our algorithm starts 
by considering the whole graph the step 2, but in A;-core subgraphs with more than one bicom¬ 
ponent, the following steps are performed for each bicomponent of the k-core. We will only 
compute up until k = A because the largest core number of a node in G is 4. 

For A: = 3 we create an auxiliary graph with all biconnected nodes with core number > 3 
(see figure 6 ). In this case all nodes have a core number greater than or equal to 3. Thus the 
auxiliary graph H for k = 3 contains all 99 nodes. We then link two nodes in P 3 if we can find 
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(a) Auxiliary graph H for k = 3 computed using 
White & Newman’s approximation algorithm for lo¬ 
cal node connectivity. 



(c) All subgraphs Hcandidate from H 3 computed us¬ 
ing White & Newman’s approximation algorithm for 
local node connectivity. 



(b) Auxiliary graph H for k = 3 computed using 
flow-based connectivity algorithm for local node con¬ 
nectivity. 



(d) All subgraphs Hcandidate from H 3 computed us¬ 
ing flow-based connectivity algorithm for local node 
connectivity. 



(e) Detected tri-components using the heuristics with 
the relaxation criteria of density > 0.95 in Hcandidate- 

Figure 6: Auxiliary graph H 3 , for k = 3. Note that when using White and Newman’s approx¬ 
imation algorithm for local node connectivity (subfigure a), some node independent paths are 
not detected: the P subgraphs linked to the two that overlap in two nodes should have core 
number 14 (blue) as in subfigure b, but they have core number 12. Thus to correctly detect all 
tricomponents we have to set a relaxation criteria for Hcandidate, in this example setting density 
at 0.95 or allowing a variation of 2 in the degree of all nodes of Hcandidate, allows the algorithm 
to correctly detect all tricomponents. 
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(a) Auxiliary graph H for k = A computed using (b) Detected 4-components using our heuristics. Note 
White and Newman’s approximation. that there should be four more K^, the ones that over¬ 

lap in two nodes are not detected as 4-components. 
See text for an explanation. 

Figure 7: Auxiliary graph 7/4 for k = A. \n this case both White and Newman’s approximation 
algorithm, and the exact flow-based algorithm for local node connectivity yield equal results. 
Note that there should be four more in subfigure b, the ones that overlap in two nodes are 
not detected as 4-components. This is because, as can be seen in subfigure a, the nodes in these 
H candidate Subgraphs have all the same core number, but their density is 0.67 and the difference 
in degree is 3. Thus, in order to detect them we would have to relax the clique criteria for 
Hcandidate too much, and even then we would classify both as a single 4-component, which 
is obviously wrong. 

k or more node independent paths between them. As we can see, the result are five connected 
components, four of which correspond to each Petersen graph plus the two K^, while the last 
one corresponds to the nodes that form the grid. The later has 4 nodes that are linked by 3 node 
independent paths to only one node, these four nodes are the four corner nodes of the grid. 

Notice that when using White and Newman’s approximation algorithm for local node con¬ 
nectivity (subfigure 6a), some node independent paths that actually exist are not detected: the 
P subgraphs linked to the two that overlap in two nodes should have a core number of 14 
(blue) because there are 3 node independent paths linking each pair of different nodes in the 
subgraph formed by the P and the to which it is linked through three edges, as in subfig¬ 
ure 6b, which was computed using the exact flow-based algorithm for local node connectivity. 
Notice also that the grid has core number 14 in 6a but actually should be core number 20 as 
shown in 6b. This illustrates the importance of computing biconnected components of H (setp 
3.c) before building the subgraphs Hcandidate (step 3.d). 

Figures 6c and 6d depict Hcandidate subgraphs, the former using White and Newman’s ap¬ 
proximation algorithm and the latter using an exact flow-based algorithm for local node con¬ 
nectivity. The subgraphs Hcandidate are composed by nodes that are in the same biconnected 
component of H and have exactly the same core number. Notice that in figure 6c the P graphs 
linked to the two that overlap in two nodes have core number < n — 1 (the magenta clusters), 
thus they are not complete (density=0.96) and the degree of their nodes is not homogenous: two 
nodes have degree 12, four have degree 13, and nine have degree 14. Therefore, if we enforce 
the clique critera for Hcandidate we would not detect all tricomponents because, following the 
algorithm, we would have to start removing nodes with the lowest degree and check if at some 
point we find a complete subgraph. In order to correctly detect all tricomponents in this illustra¬ 
tive example, we have to first establish a relaxation for the clique criteria for Hcandidate- In this 
case, setting density at 0.95 or allowing a variation of 2 in the degree of all nodes of Hcandidate^ 
allows the algorithm to correctly detect all tricomponents as shown in figure 6e. 
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For k = A, the auxiliary graph H 4 is composed of 4 connected components which corre¬ 
spond to the pairs of that share one node and the pairs of that share 2 nodes (see figure 
7a). In terms of biconnectivity, there are six bicomponents, with the two that overlap in two 
nodes as a single bicomponent. Inside these six bicomponents there are eight 4-components, but 
only four of them were detected (see figure 7b). This is because when we build the Hcandidate 
subgraphs with all nodes in each biconnected component of H 4 , that have exactly the same core 
number, in the case of the two that overlap in two nodes, all their nodes have the same core 
number (4), but their density is 0.67 and the difference in degree is 3. Thus, in order to detect 
them we would have to relax the clique criteria for Hcandidate too much, and even then, we 
would classify both overlaping in two nodes as a single 4-component, which is obviously 
wrong because they have node connectivity 2. 

Note that this kind of false negative only happens when two ^-components of the same level 
of connectivity and the same order overlap. If instead of two they were fc-components with 
different order but the same connectivity, our algorithm would be able to separate them because 
they would have a different core number and thus they would be part of a different Hcandidate 
subgraph. 


Appendix B Performance analysis 

The heuristics presented here are implemented on top of NetworkX (Hagberg et ah, 2008), a li¬ 
brary for the analysis of complex networks, using the Python programming language (Van Rossum, 
1995). We have chosen Python because it is a language with high readability and flexibility that 
allows you to easily apply the well know principle of writing software for people to read and, 
only incidentally, for machines to execute (Abelson et ah, 1985). To ensure reproducibility and 
accessibility we have used only free software to build and run all analyses presented in this 
paper. 

The implementation of the heuristics presented here is not trivial; a careful implementation 
is needed to ensure that it has a reasonable memory footprint and that it runs in a reasonable 
time. Appendix C contains a detailed discussion of the implementation details and appendix D 
contains the python code of a simplified implementation for illustrative purposes. 

Figure 8 presents the performance of the heuristics (green) compared with two variants of 
the exact algorithm: the Moody & White algorithm based on /c-cutsets (red) and our algorithm 
using exact flow-based node connectivity for building the auxiliary graph. The tests were per¬ 
formed, on the one hand, on random graphs with fixed average degree (Erdos-Renyi model) 
and fixed power law exponent (Power law model) of several different orders. And, on the other 
hand, for graphs with a fixed number of nodes (1000 for the heuristics and 100 for the exact) 
where we increase the number of edges. Random networks built using the Erdos-Renyi model 
have a flat hierarchical structure because edges are evenly distributed across all nodes of the net¬ 
work. The Erdos-Renyi graphs used in this benchmark have a big tricomponent and no higher 
connectivity levels. Random networks built using a power law based degree distribution have a 
steep hierarchical structure, the networks used in the benchmark have hierarchy levels of up to 
20. Both the heuristics and the exact algorithms perform better in sparse networks with a steep 
hierarchical structure. 

As we can see in figure 8 the heuristics runs in polynomial time. It is fast enough to be 
practically applicable to networks with a few tens of thousands of nodes and edges. This is 
one order of magnitude better than the exact algorithm proposed by Moody and White (2003), 
and also an order of magnitude faster than using flow-based algorithms for building the auxil¬ 
iary graph. Notice that the A:-cutset based algorithm proposed by Moody & White (or at least 
our implementation) is faster than the exact flow-based local node connectivity variant of our 
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(a) Performance of connectivity algorithms when 
adding nodes maintaining constant the average degree 
(Erdos-Renyi) or the exponent of the power law gov¬ 
erning the degree distribution (a = 2). Logarithmic 
scale. 


(b) Performance of the heuristics when adding edges 
and maintaining nodes constant (1000 nodes). Inset; 
performance of the exact algorithm with one order of 
magnitude fewer nodes (100 nodes). Both in logarith¬ 
mic scale. 


Figure 8: Loglog plots for comparing between the heuristics and the exact algorithm to compute 
A:-component structure. In this comparison, the heuristics do not compute the average node 
connectivity, only plain node connectivity, which is what is calculated by the exact algorithm. 
We have also implemented the exact algorithm in order to be able to compare both algorithms 
using the same language and infrastructure. All figures presented here were obtained running 
PyPy (Bolz et ah, 2009). Using the heuristics proposed in this paper, we are able to handle 
networks almost one order of magnitude bigger than with the exact algorithm. 


algorithm. 

The implementation that we provide in this paper only considers the exact solution for bi- 
connected components. The heuristics presented here uses biconnectivity, but can be improved 
by using a triconnectivity algorithm. It would be: a) faster because there is a linear algorithm to 
compute triconnected components (Hopcroft and Tarjan, 1974; Gutwenger and Mutzel, 2001); 
and, b) more accurate, because we compute the exact solution up to A: = 3. But, as far as we 
know, there is no publicly available implementation of triconnected components. An optimal 
implementation of the heuristics presented here would have to incorporate the triconnectivity 
algorithm to improve its accuracy and to allow it to run in reasonable time on somewhat larger 
networks. 
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Appendix C Implementation details 

The implementation of the heuristics proposed here was done by the first author listed on the 
NetworkX python library (Hagberg et ah, 2008), a Python package for the study of the structure 
and dynamics of complex networks. Other parts of the powerful Python (Van Rossum, 1995) 
scientific computing stack (Jones et ah, 2001; Perez and Granger, 2007; Hunter, 2007) were 
also essential. The main requirement was that the whole software stack must be free software 
in order to avoid the black box effect of software solutions that do not release their source 
code. We belive that this is a necessary condition for ensuring the reproducibility of scientific 
research. Appendix B contains python code for the main part of the algorithm. 

The implementation of the heuristics is not trivial. There are a few questions that need to be 
addressed in order to obtain a performance —both in terms of computation time and memory 
consumption— that will allow for these heuristics to be applied to large networks. The authors 
are in-debted to Aric Hagberg and Dan Schult (developers of the NetworkX package) for their 
help in this implementation. 

The second step of the heuristics (compute the biconnected components of the input graph 
and use them as a baseline for /c-components with /c > 2) is faster than using the logic of the 
heuristics for k = 2. Biconnected components computation runs in linear time in respect to 
the number of nodes and edges (Tarjan, 1972). Besides in large networks, bicomponents are 
formed by an important part of the nodes of the network. Thus if we use the approximation 
logic to compute them, the memory footprint for large networks is too large to be practical. The 
implementation provided with this paper only computes the exact solution for bicomponents 
but there is also a linear algorithm to compute triconnected components (Hopcroft and Tarjan, 
1974; Gutwenger and Mutzel, 2001). The heuristics would be even faster if we applied the 
approach used for bicomponents to that of tricomponents. But the implementation of tricon¬ 
nectivity is quite challenging and, to our knowledge, there is no implementation of triconnected 
components in free network analysis software packages. 

The auxiliary graph H is usually very dense in real world networks because a large part of 
nodes that are in a biconnected part of a k-core are actually part of a /c-component. The memory 
footprint of creating this dense auxiliary graph prevents a naive implementation of the heuristics 
in order to be practical for large networks. Our solution for this problem is to use a complement 
graph data structure that only stores information on the edges that are not present in the actual 
auxiliary graph. When applying algorithms to this complement graph data structure, it behaves 
as if it were the dense version. This is the only way to have a memory footprint that will allow 
for the application of the heuristics presented in this paper to large networks. 
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Appendix D Python code 


This is a simplified implementation of the heuristies for illustrative purposes. A fully funetional 
version of NetworkX paekage with all the eode neeessary to run the heuristies is available from 
the authors upon request. 


1 # Standard python libraries 

2 import i t e r t o o 1 s 

3 import collections 

4# NetworkX library for network analysis 

5 import networkx 

6 # White and Newman node connectivity approximation 

I # Code in https : // networkx . lanl . gov / trac / ticket/538 

8 from connect!vity_approx import vertex_connectivity_approx 

9 # AntiGraph data structure 

10 # code in https : // networkx . lanl . gov / trac / ticket /608 

II import antigraph 
12 

13 def k_components(G, average=True , exact = False , min_density =0.95); 
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def _update_results (k, avg_k, components): 

# Auxiliary function to update results data structures 

# Code not shown 

if exact: # Use flow based exact algorithm 

node_connectivity = nx . local_node_connectivity 
else: # Use White and Newman (2001) approximation algoritm 
node_connecti vity = local_node_connectivity 
## Data structures to return results 

# Dictionary with connectivity level (k) as keys and a list of 

# sets of nodes that form a k—component as values 
k_components = collections . defaultdict ( 1 i s t ) 

# Dictionary with nodes as keys and maximum k of the deepest 

# k—component in which they are embedded 
k_number = dict( ((n,(0,0)) for n in G) ) 

# diet to store node independent paths 
nip = {} 

################# 

# Exact solution for k = 1 

components = networkx . connected_components (G) 

_update_results (1 , 1, components) 

# Bicomponents as a base to check for higher order k—components 
bicomponents = networkx . biconnected_components (G) 

_update_results (2 , 2, bicomponents) 

# There is no k—component of k > maximum core number 

# \kappa(G) <= \lambda(G) <= \delta (G) 
g_cnum = core_number (G) 

max_core = max(g_cnum . values ()) 
for k in range(3, max_core + 1): 

C = k_core (G, k, core_number=g_cnum) 
for nodes in biconnected_components (C): 

# Build a subgraph SG induced by the nodes that are part of 

# each biconnected component of the k—core sub graph C. 
if len(nodes) < k: 

continue 

SG = G. subgraph(nodes) 

# Build auxiliary graph 
H = AntiGraph () 

H. add_nodes_from (SG. nodes_iter ()) 
for u,v in combinations (SG, 2): 

K = node_connectivity (SG, u, v) 
nip [ (u , V) ] = K 
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if k > K: 

H. add_edge(u,v) 

for h_nodes in biconnected_components (H); 
if len(h_nodes) <= k: 

continue 

HS = H. subgraph(h_nodes) 
h_cnum = core_number (HS) 
first = True 

for c_value in sorted ( set (h_cnum . values ()), re verse = True ): 

cands = set(n for n, cnum in h_cnum. items () if cnum == c_value) 
# Skip checking for overlap for the highest core value 
if first: 

overlap = False 
first = False 
else : 

overlap = set . intersection (* [ 

set(x for X in HS[n] if x not in cands) 
for n in cands]) 
if overlap and len(overlap) < k: 

He = HS . subgraph ( cands I overlap) 
else : 

He = HS .subgraph(cands) 
if len(He) <= k: 

continue 

hc_core = core_number (He) 

if _same(hc_core) and density(Hc) == 1.0: 

Gc = k_core(SG.subgraph(He), k) 

else : 

while He: 

Ge = k_eore(SG.subgraph(He), k) 

He = HS.subgraph(Ge) 

if not He: 
continue 

he_eore = eore_number (He) 

if _same ( he_eore) and density(He) >= min_density : 

break 

he_deg = He.degree() 

min_deg = min(he_deg.values()) 

remove = [n for n, d in he_deg . items () if d == min_deg ] 
He . remove_nodes_from (remove ) 
if not He or len(Ge) <= k: 

continue 

for k_component in biconnected_components (Gc): 
if len (k_component) <= k: 

continue 

Gk = k_core (SG. subgraph (k_component) , k) 
num = 0.0 
den = 0.0 

for u,v in combinations( Gk, 2): 
den += 1 

num += (nip[(u,v)] if (u,v) in nip 
else nip [(v ,u) ]) 

_update_results (k , [Gk. nodes ()] , (num/den)) 
return k_components , k_number 
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Appendix E Accuracy and limitations of the heuristics 

Figure 9 shows the aeeuraey of eonneetivity strueture deteeted by the heuristies for all em¬ 
pirical networks. In the subfigures, green bars are /c-components with node connectivity > k 
and red bars represent /c-components with node connectivity < k. Note that, once we have 
an approximate structure of /c-components, we can check —in a reasonable time frame— 
if the resulting /^-components actually have node connectivity k using flow based connectiv¬ 
ity algorithms (Brandes and Erlebach, 2005, chapter 7). For the candidate /c-components that 
turned out to have node connectivity lower than k, we used the exact algorithm proposed by 
Moody and White (2003) to find out the order and size of the actual /s-components inside the 
candidate /c-component detected using our heuristics. 

The output of our heuristics is an approximation to /c-components based on computing extra- 
cohesive blocks for each biconnected component of all core levels of the network. Recall that in 
/^-components all k node independent paths go through nodes that belong to the /c-component, 
but in extra-cohesive blocks some of the node independent paths may go through external nodes. 
Thus, there is no guarantee that the extra-cohesive blocks, even those that also form a /c-core 
subgraph in G, have node connectivity k = k. This is a source of false positives for the ap¬ 
proximation of the /c-component structure of a network. However, the results shown in figure 
9 suggest that the heuristics yield a good approximation for the actual —/c-component based— 
cohesion structure of empirical networks. 

If we consider all components of all sizes, as in figure 9, only a few of the extra-cohesive 
blocks detected by the heuristics have node connectivity of less than k, ranging from 6.5% (a 
single component) in the case of Debian to 1.2% of the components in the case of two-mode 
Nuclear Theory network. However, the extra-cohesive blocks that do not have the sufficient 
connectivity to be considered a /c-component are, in the empirical networks analyzed, big com¬ 
ponents of levels {3,4}. This is because, in such big- and low-level components, a few node 
independent paths going through nodes that are part of the biconnected component of a /c-core 
but not part of the /c-component can yield false positives by including nodes that shouldn’t be 
part of the /c-component. 

However, these false positives are actually part of an extra-cohesive block, which maintains 
most of those properties —in terms of robustness, hierarchy and overlap— which make k- 
component such a good measure of structural cohesion. This relaxed definition of connectivity 
might be sufficient in many cases; for instance, if we are interested in comparing the structural 
cohesion of a large network with a suitable null model, we may not need the exact /c-component 
structure because we can meaningfully compare the relaxed connectivity structure of the actual 
network with its random counterparts. However, imagine we are interested in the exact k- 
component structure of a particular network because, say, we want to statistically analyze the 
impact of the connectivity level with the performance of different actors in a network. In this 
case, we would need to apply some cutting procedure on the extra-cohesive blocks that actually 
have a node connectivity of less than k. 

It is more difficult to assess the impact of false negatives —that is, nodes that should be part 
of a /c-component but are excluded— because computing exact /c-components for big networks 
is not practical, and thus we cannot compare. False negatives are derived from the underestima¬ 
tion of local node connectivity of the White and Newman (2001) algorithm, which provides a 
strict lower bound for the local node connectivity. Thus, by using it we can miss an edge in the 
auxiliary graph H that should be there. Therefore, a node belonging to a /c-component could be 
excluded by the algorithm. Recall that in order to address this problem, we relaxed the clique 
criteria by setting a density threshold of 0.95 in Hcandidate- Whilst this value has worked well 
in our analysis but careful experimentation should be performed to set this parameter in other 
types of networks. 
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(a) Bipartite network formed by developers and 
packages over 2 years of collaboration (from 2007 
to 2009) on the release codenamed Lenny of the De- 
bian operating system 



(b) Unipartite network formed by developers over 
2 years of collaboration (from 2007 to 2009) on the 
release codenamed Lenny of the Debian operating 
system 



(c) Bipartite network formed by scientists and 
preprints during a five-year period (2006-2010) in 
the high energy physics (theory) section of arXiv.org 



(d) Unipartite network formed by scientists during a 
five-year period (2006-2010) in the high energy physics 
(theory) section of arXiv.org 




(e) Bipartite network formed by scientists and (f) Unipartite network formed by scientists during a five-year 
preprints during a five-year period (2006-2010) in period (2006-2010) in the nuclear theory section of arXiv.org 
the nuclear physics (theory) section of arXiv.org 

Figure 9: Accuracy barplots. Green bars are fc-components with node connectivity > k and red 
bars represent A:-components with node connectivity < k. 
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